The Index Thomisticus Treebank Project: Annotation, Parsing and Valency Lexicon
نویسندگان
چکیده
We present an overview of the Index Thomisticus Treebank project (IT-TB). The ITTB consists of around 60,000 tokens from the Index Thomisticus by Roberto Busa SJ, an 11million-token Latin corpus of the texts by Thomas Aquinas. We briefly describe the annotation guidelines, shared with the Latin Dependency Treebank (LDT). The application of data-driven dependency parsers on IT-TB and LDT data is reported on. We present training and parsing results on several datasets and provide evaluation of learning algorithms and techniques. Furthermore, we introduce the IT-TB valency lexicon extracted from the treebank. We report on quantitative data of the lexicon and provide some statistical measures on subcategorisation structures. RÉSUMÉ. Nous présentons une vue d’ensemble du projet de l’Index Thomisticus Treebank (ITTB). L’IT-TB consiste d’environ 60,000 occurrences tirées de l’Index Thomisticus de Roberto Busa SJ, un corpus de onze millions de mots latins de Thomas d’Aquin. Nous décrivons brièvement les règles d’étiquetage, qui sont en commun avec la Latin Dependency Treebank (LDT). Nous décrivons l’application des parseurs probabilistes dépendanciels sur les données de l’ITTB et de la LDT. Nous présentons les résultats de l’entraînement et de l’analyse syntactique sur plusieurs ensembles des données et nous fournissons une évaluation des algorithmes et des techniques d’apprentissage. En outre, nous introduisons le lexique de valence de l’IT-TB tiré de la treebank. Nous reportons les données quantitatives du lexique et nous fournissons quelques mesures statistiques sur les structures de sous-catégorisation.
منابع مشابه
The Development of the "Index Thomisticus" Treebank Valency Lexicon
We present a valency lexicon for Latin verbs extracted from the Index Thomisticus Treebank, a syntactically annotated corpus of Medieval Latin texts by Thomas Aquinas. In our corpus-based approach, the lexicon reflects the empirical evidence of the source data. Verbal arguments are induced directly from annotated data. The lexicon contains 432 Latin verbs with 270 valency frames. The lexicon is...
متن کاملSelectional Preferences from a Latin Treebank
We present a system for automatically acquiring selectional preferences for Latin verbs. We use the Index Thomisticus Treebank Valency Lexicon and an enriched version of Latin WordNet as the reference conceptual hierarchy.
متن کاملAn annotation scheme for Persian based on Autonomous Phrases Theory and Universal Dependencies
A treebank is a corpus with linguistic annotations above the level of the parts of speech. During the first half of the present decade, three treebanks have been developed for Persian either originally or subsequently based on dependency grammar: Persian Treebank (PerTreeBank), Persian Syntactic Dependency Treebank, and Uppsala Persian Dependency Treebank (UPDT). The syntactic analysis of a sen...
متن کاملThe Annotation Guidelines of the Latin Dependency Treebank and Index Thomisticus Treebank: the Treatment of some specific Syntactic Constructions in Latin
The paper describes the treatment of some specific syntactic constructions in two treebanks of Latin according to a common set of annotation guidelines. Both projects work within the theoretical framework of Dependency Grammar, which has been demonstrated to be an especially appropriate framework for the representation of languages with a moderately free word order, where the linear order of co...
متن کاملBuilding a Bilingual ValLex Using Treebank Token Alignment: First Observations
In this paper we explore the potential and limitations of a concept of building a bilingual valency lexicon based on the alignment of nodes in a parallel treebank. Our aim is to build an electronic Czech↔English Valency Lexicon by collecting equivalences from bilingual treebank data and storing them in two already existing electronic valency lexicons, PDT-VALLEX and Engvallex. For this task a s...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- TAL
دوره 50 شماره
صفحات -
تاریخ انتشار 2009